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Abstract 

Nonlinear manifold learning algorithms, such as diffusion maps, have been fruitfully applied in recent years 
to the analysis of large and complex data sets. However, such algorithms still encounter challenges when 
faced with real data. One such challenge is the existence of “repeated eigendirections,” which obscures the 
detection of the true dimensionality of the underlying manifold and arises when several embedding coordi¬ 
nates parametrize the same direction in the intrinsic geometry of the data set. We propose an algorithm, 
based on local linear regression, to automatically detect coordinates corresponding to repeated eigendirec¬ 
tions. We construct a more parsimonious embedding using only the eigenvectors corresponding to unique 
eigendirections, and we show that this reduced diffusion maps embedding induces a metric which is equiv¬ 
alent to the standard diffusion distance. We first demonstrate the utility and flexibility of our approach 
on synthetic data sets. We then apply our algorithm to data collected from a stochastic model of cellular 
chemotaxis, where our approach for factoring out repeated eigendirections allows us to detect changes in 
dynamical behavior and the underlying intrinsic system dimensionality directly from data. 
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1. Introduction 


In recent years, data mining algorithms have proven useful for many disciplines and applications 0,i 
iasisi- For multiscale dynamical systems in particular, data-driven methodologies are essential 
for assisting in model reduction when simple macroscopic models cannot be analytically obtained due to 
the system complexity S E m Q. For simulations or experimental observations of such systems, the 
detailed, microscale evolving system state is very high-dimensional, and the reduction to useful macroscale 
dynamics is not obvious. In such cases, data obtained from observations and/or simulations of the dynamical 
system combined with data mining methodologies can lead to a low-dimensional description which not only 
provides insight into the underlying dynamics, but also serves as a first step in constructing macroscale 
models consistent with the observed microscale behavior. 

Several manifold learning algorithms obtain parametrizations of the data through the spectral analysis 
of a Laplace operator [l^, 14, [T^ 16 1- The data is then embedded in a new low-dimensional coordinate 


system given by the eigenvectors of this operator. The premise is that these coordinates, obtained in a data- 
driven manner, are the right macroscopic “observables”, i.e., the variables which parametrize the macroscale 
dynamical behavior of the system, thus enabling the construction of a reduced macroscale model. 

Such manifold learning algorithms were initially applied to synthetic data sets, to illustrate their geo¬ 
metric properties and flexibility [I11I3- More recently they have been applied to experimental as well as 
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simulation data, enabled by advances in data representation (observers) and metrics 18, 1^, 2^, 21, 22, 231. 
A nontrivial shortcoming of these methods is that, from a geometric perspective, not all eigenvectors are 
guaranteed to parametrize unique directions within the data; some eigenvectors are “repeated eigendirec- 
tions” which describe the same coordinate in the intrinsic geometry of the data. Identifying these eigenvec¬ 
tors is critical for obtaining a parametrization of the system which captures the true dimensionality of the 
macroscale dynamics. This task is often done manually, and few methods have been proposed to automate 
the identification of the unique eigendirections 


24|. 


In this paper, we propose an algorithm to automatically identify such unique eigendirections using local 
linear regression [2^. We first demonstrate and validate our algorithm on synthetic examples, where a closed- 
form solution for the eigenvectors and eigenvalues of the Laplace operator is known. We then consider a 
simulation data set from a stochastic dynamical system modeling cellular chemotaxis . Recent advances 
in observers and metrics, coupled with the proposed approach for identifying the unique eigendirections, 
provide a data analysis pipeline which successfully analyzes this simulation in a purely data-driven manner. 
We will show that this pipeline allows us to detect changes in the dimensionality of the underlying macroscale 
dynamics, which are related to changes in the regime/mode of the system. 


2. Manifold Learning Based on Laplace Operators 


Let 2;(1),..., z{m) € M" denote m state observations sampled from an evolving autonomous dynamical 
system. We assume the n-dimensional observations z(i) lie on a d-dimensional manifold Aid, where d < n. 
We therefore consider the problem of parametrizing the continuous d-dimensional manifold Aid embedded 
in R." from data. For linear hyperplanes, the principal axes parametrize the manifold. However, in the 
case when the manifold is nonlinear, a set of appropriate coordinates is not readily apparent. By using a 
particular manifold learning technique based on the construction of a Laplace operator, namely, diffusion 
maps, we will show how to extract a d-dimensional parametrization of the observations which is consistent 
with the geometry of the manifold 


13|,l]J,ll6|. 


2.1. Eigenfunctions of the Laplace-Beltrami operator 

In recent years it has been noted that the eigenfunctions of the continuous Laplace-Beltrami operator 
provide “good” coordinates for a manifold We start by considering such a continuous setting, where 
rigorous analysis is possible for specific examples. Using such an example, we will demonstrate that the 
existence of repeated eigendirections is inherent to the manifold learning setup based on Laplace operators, 
and that the identification of the unique eigendirections requires some additional effort. The fact that these 
challenges arise in the continuous setting implies that, even in the limit of infinite data, such repeated 
eigendirections still pose a problem for data analysis. 

To review and illustrate why these eigenfunctions provide appropriate coordinates, consider a two- 
dimensional strip with edge lengths Li and L 2 . The eigenvalues of the Laplace-Beltrami operator with 
Neumann boundary conditions are given by 


dki,k2 — 



for ki,k 2 = 0,1, 2,..., and the corresponding eigenfunctions are 


4^ki ,k2 — cos 


f fclTTZl 


COS 


/ k2TfZ2\ 

\ L2 ) 


( 1 ) 

( 2 ) 


where Zi and Z 2 denote the two coordinates of the strip 0- We note that the eigenfunctions ^ 1,0 = 
cos (ttzi/Li) and = cos{'kz 2 /L 2 ) are one-to-one with the zi and Z 2 coordinates, respectively, on the 
domain, and therefore provide a parametrization of the underlying manifold (see Figure ITal) . Furthermore, 
the corresponding eigenvalues /ii_o and /io,i provide an estimate of the relative magnitude of Li versus L 2 '. 
as the ratio between Li and L 2 increases, the gap between and /io,i also increases (this will be discussed 
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Figure 1: (a) Two-dimensional continuous strip colored by the eigenfunctions (pi^o = cos (ttzi/Li), and 
00,1 = cos (7rz2/i2)- (b) Data, uniformly sampled from a two-dimensional strip, colored by the first and 
fourth (non-trivial) eigenvectors of the discrete Laplacian. (c) Data, sampled from a Gaussian distribution 
in Zi and sampled uniformly in Z 2 , colored by the first and third (non-trivial) eigenvectors of the discrete 
Laplacian. Note that in all cases we uncover parametrizations which are one-to-one with zi and Z 2 on the 
domain. 


further in Section 1321 ). The analytic form of the eigenfunctions in © illustrates the two issues we address 
in this paper. First, Zi and Z 2 are not necessarily decoupled in subsequent eigenfunctions, and a proper 
parametrization of the manifold is not necessarily given by the d eigenfunctions associated with the smallest 
d eigenvalues. Second, eigenfunctions with fci -|- ^2 > 2 do not describe any additional directions intrinsic to 
the strip geometry; we will refer to these as “repeated eigendirections,” and we will refer to the eigenfunctions 
with ki + k 2 = 1 as “unique eigendirections.” Clearly, this problem of repeated eigendirections also arises 
for more curved of more than two variables. We note that, although the eigenfunctions can only be written 
analytically for very special cases, it has been observed empirically that they often provide appropriate 
coordinates to usefully parametrize more complicated, nonlinear manifolds. 


2.2. Discrete approximation of the Laplace-Beltrami operator: diffusion maps 

In most applications, we are not given a description of the continuous manifold on which the data lie. 
Instead, we are given data sampled from the underlying manifold, and the parametrization of the manifold 
needs to be uncovered from the data. This can be accomplished by constructing a matrix which approxi¬ 
mates the Laplace-Beltrami operator. It was shown in |I5l | that, in the limit of infinite data, this discrete 
Laplacian matrix constructed from data converges pointwise to the continuous Laplace-Beltrami operator 
on the manifold. As a result, the eigenvectors of the discrete Laplacian approximate the eigenfunctions of 
this continuous operator. 

Given observations z{l),..., z{m) G Add, we first construct the weight matrix W G with 


Wij = exp 


\\zii) - 


i,j = 1,...,TO, 


( 3 ) 


where || • || denotes the appropriate norm for the observations, and e is a characteristic distance between the 
observations. The kernel’s scale e can be chosen using several heuristics 22, we often take e to be the 
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median of the pairwise distances between the data points. The underlying assumption is that, even in cases 
where the samples embedded in the high dimensional space lie on a highly nonlinear manifold, within e-local 
neighborhoods, the Euclidean distance respects the tangent plane to the manifold and thus locally conveys 
meaningful relationships,. We then construct the diagonal matrix D G with Du = 

form the matrix W = D~“WD““, where 0 < a < 1. Next, we construct the diagonal matrix D G 
with Du = Wij, and the matrix A = D~^W. 

If the data ^(l),..., z(m) are sampled from A4d with some density q, then, for e —>■ 0 and m —>■ oo 
(with the appropriate rates), the discrete matrix converges to the following continuous limit operators with 
Neumann boundary conditions (as discussed in the previous section) 


4(1 - AU - 2VU ■ Vfi, 

a = 0 

(4) 

4(1 - AU VV - VI/ • 

a = 1/2 

(5) 

1(i-a)0^v2(/>, 

a = 1 

(6) 


where U = — log q. The different limit operators, depending on the choice of a, imply that nonuniform sam¬ 
pling on the manifold may have different effects via the potential U in (H])-®. In particular, by setting a = 1, 
one can factor out the density effects in the weight matrix, and the discrete Laplacian matrix approaches 
the Laplace-Beltrami operator on the manifold Aid- Therefore, the eigenvectors , (pm-i of A ap¬ 

proximate the eigenfunctions of the Laplace-Beltrami operator on Aid, and the eigenvalues ^o, Mi, ■ • ■, Mm-i 
of A are related to the eigenvalues of the continuous operator by 

= exp (V 


As discussed previously, the eigenfunctions provide a parametrization of the manifold, such that 
yields the embedding coordinate for z{i), where r G N is a parameter (in our examples, we will take 
r = 0). The parameter r corresponds to the number of iterates of the Laplace operator, or in other words, 
the number of diffusion steps (for more details, see [l^). This mapping z{i) i—>■ (p,[0i(i ),..., 
is the standard diffusion maps embedding 14, 11, where we order the eigenvectors such that |/io| > |mi| > 
■ • • > |Mm-i|- Because the matrix A is row-stochastic -^ij — 1), Mo = 1 and ipo is a trivial constant 
vector. The distance induced by this mapping is called the standard diffusion distance. 


m —1 

z{j)) = X! “ ^k{3)f ■ ( 8 ) 

fc=i 

Figures fTbUIcl shows data sampled from a strip, colored by eigenvectors of A. In cases of both uniform and 
nonuniform sampling, the selected eigenvectors are one-to-one with zi and Z 2 , and thus parametrize the 
manifold. Although we have considered a very simple example for illustrative purposes, common practice is 
to use these tools for high-dimensional, nonlinear data sets. 

From the previous section, we know that the eigenfunctions with (^ 1 ,^ 2 ) = (1,0) and (/ci,fc 2 ) = (0,1) 
provide embedding coordinates for the manifold; these two eigenfunctions are both uncoupled and not 
repeated. From o, we see that sorting the eigenvectors by the magnitude of the corresponding eigenvalues 
implies that these two eigenvectors are guaranteed to appear before any coupled or repeated eigendirections. 
However, these eigenvectors are not guaranteed to appear as the first two (non-trivial) eigenvectors, as 
harmonics of the first eigendirection (i.e., cos (n7rzi/Li) with n > 1) could appear before the second. We 
note that, in contrast to our simple illustrative example, for most data sets of interest, the coordinates for 
the underlying manifold are unknown and cannot easily be obtained from the coordinates of the original 
data, so that identifying which eigenvectors correspond to unique eigendirections is nontrivial. 


4 





Figure 2: Data, uniformly sampled from a two-dimensional strip, colored by the first four (non-trivial) 
eigenvectors from diffusion maps. Note that the first and fourth eigenvectors are one-to-one with zi and Z 2 , 
respectively. However, the second and third eigenvectors are higher harmonics of the first eigenvector and 
do not capture any additional structure within the data set. 


3. Identifying the Informative Eigenvectors 

Common practice is to order the eigenvectors by the magnitude of the corresponding eigenvalues, as¬ 
suming that the leading eigenvectors provide a parametrization of the underlying manifold. However, as 
discussed in the previous section, some eigenvectors are higher harmonics of previous eigenvectors and do 
not describe new directions in the data set [ 2 ^. For the case where Aid is a 2-dimensional strip with edge 
lengths Li > L 2 , recall that the eigenfunctions = cos (tt^i/Li) and (/)o,i = cos (7rz2/F2) provide embed¬ 
ding coordinates for the manifold Aid- However, these two eigenvectors 0i_o and are not guaranteed to 
correspond to the two smallest (non-trivial) eigenvalues (the smallest eigenvalue will always be /io,o = 0 and 
correspond to a constant eigenfunction; these eigenvalues are related to the eigenvalues p,k of the discrete 
operator via ©). In fact, if Li > 2 L 2 , then jl 2 ,o < P-o.i, and so the second (non-trivial) eigenvector (when 
the eigenvectors are ordered by their corresponding eigenvalues) will be a “repeated eigendirection” of the 
first and still parametrize zi (see Figure [5]). Automatic detection of which eigenvectors are harmonics of 
each other is clearly useful. Utilizing only the eigenvectors (j)i which correspond to unique eigendirections 
yields the most parsimonious representation of the data. We will show how we can automatically detect 
these eigenvectors to obtain a meaningful representation of the data, and that using only these eigenvectors 
yields a reduced embedding which is equivalent to the standard diffusion maps embedding. Furthermore, 
when the data is uniformly sampled from the underlying manifold, the corresponding eigenvalues provide 
us with an estimate of the relative lengths of the data set along these unique eigendirections. 


3.1. Algorithm: local linear regression 

Given the eigenvectors tpi, (j) 2 ,..., 4>m-i S M’", we would like to automatically deduce which ones capture 
new directions in the data, and which ones are merely repeated eigendirections. This problem was addressed 
previously in 2^ by performing successive iterations of diffusion maps, interspersed with advection along 
the first eigendirection at each iteration. However, both the advection procedure and the successive eigende- 
compositions are expensive and soon become intractable for larger data sets. Here, we propose an alternative 
approach to address the problem of repeated eigendirections motivated by simple trigonometric arguments 
(for example, the repeated eigendirection cos 2x can be written as a function of the unique eigendirection 
cosx). We therefore attempt to fit a function /(t/fi,..., ^fe_i) to if the resulting fit is accurate, we 
assume 4>k is a repeated eigendirection of (/)i,..., (fik-i- We use a local linear function 


~ otk{i) + /3j(i)$fc_i(i) 


( 9 ) 


as our functional approximation, where $fc_i(i) = ... (/)/c-i(i)]^, ak{i) G R, and /3/c(f) € 

The coefficients ak(i) and /3fc(i) are not constant because we use a local linear fit in the k — 1-dimensional 
$^-1 space, and so the coefficients change as a function of the domain. 

At each point fl>fc_i(f), we approximate <j)k{i) by fitting a local linear function using the remaining m — 1 
data points. We solve the following optimization problem 

dfe(f),/3/c(f) = argmin V A:($fc_i(z),$fc_i(j)) {(j)k{j) - {a + (3^^ k-iij)))^ ■ (10) 
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where K is a kernel weighting function. We use a Gaussian kernel, 


K{^k-i{i), $fc-i(j)) = exp 




(i) - <^k-i 



( 11 ) 


where Creg is the kernel scale for the regression algorithm. We typically take Creg = M/3, where M is the 
median of the pairwise distances between as we empirically found this choice to yield good results. 

We then define the normalized leave-one-out cross-validation error for this local linear fit as 


rk = 


\ 


Z)r=i - (afe(i) -I- ^k{iV^k-i{i 

Eti f 


( 12 ) 


Note that a small value of rk implies that 4>k can be accurately approximated from (/i,..., 4>k-i- We reason 
that a small value of rk implies that (j)k is a harmonic of previous modes, i.e., a repeated eigendirection, and 
conversely, a large value of rk indicates that (jjk parametrizes a new unique eigendirection in the data. We 
choose to set ri = 1. The error in (fT^ can easily be computed (see Section 5.4 of [1^; code is also available 
at http://ronen.eew.technion.ac.il/sample-page/journal/). 

We propose using only those eigenvectors for which rk is large to embed the data, as this will yield a 
more parsimonious representation. However, it is important to note that such an embedding also preserves 
certain features of the standard diffusion map embedding. Let I = ■ ■ ■ ,*d} denote the indices of the 

identified unique eigendirections (i.e., rq,..., are large). Under the assumption that (j)j = f (t/i^,..., ) 

for j ^ J, where / is Lipschitz continuous with Lipschitz constant K, one can show that 

. , ^ 2 ^ - ^Dl(z{i),z(j)) < iMi) - M) f < Dl{z{i),z{j)). (13) 

^ + ^ l^k^il^k kei 


Therefore (for finite data, or provided the eigenvalues fj-k decay sufficiently fast in the limit of infinite data), 
the distance induced by the d eigenvectors i/q,..., identified as parametrizing unique eigendirections is 
equivalent to the standard diffusion distance. We will refer to this distance 


Dl{z{i), z{j)) = ^ {(jjkii) - Mj)f 

kGl 


(14) 


as the reduced diffusion distance, and the embedding obtained from the corresponding eigenvectors as the 
reduced diffusion maps embedding. 


3.2. Estimating the “relative lengths” 

For a two-dimensional strip, once the eigenvectors which parametrize unique eigendirections have been 
determined, the corresponding eigenvalues can be used to calculate the relative lengths along these directions. 
For more general manifolds (at least for the case of uniform sampling), we postulate that the eigenvalues can 
still provide a measure of the relative significance of the unique eigendirections. Let ,... denote the 

eigenvalues corresponding to eigenvectors which parametrize unique eigendirections (i.e., those eigenvectors 
where ri. is large). For the strip example, from ([IJ and ([7]), we propose to approximate the relative lengths 
Lj along the manifold by 


Lj (X 


1 

^-log/x,^. 


(15) 


These lengths can then be used to evaluate how many components are required to retain most of the 
information within the data set, similar to the eigenvalues in principal component analysis. 


3.3. Illustrative examples 

We demonstrate our proposed approach on three synthetic data sets. The first data set is the two- 
dimensional strip discussed previously, where the eigenvalues and eigenvectors are known analytically. The 
second and third data sets involve nonlinear manifolds which demonstrate the flexibility of our approach. 
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Figure 3: Data sets (top) and eigenvalue spectra from diffusion maps analysis (bottom) for strips with (a) 
Li = 2, L 2 = 1, (b) Li = 4, L 2 = 1, (c) Li = 8, L 2 = 1. The empirical eigenvalues are plotted in blue, 
and the analytical eigenvalues are plotted in red. The bars are colored by Vk, as defined in (USD- From the 
eigenvalues which are identified as parametrizing unique eigendirections (indicated by the darker blue bars), 
the estimated length ratio from (IT^ is (a) 2.2, (b) 4.1, (c) 8.7. 


3.3.1. Strip 

We consider three different two-dimensional strip data sets. Each data set contains m = 2000 data points 
uniformly sampled from the strip. Figure [3] shows the data sets and the diffusion maps eigenspectra. The 
eigenvalues are colored by the leave-one-out cross-validation error r*, as defined in (na); a small value of 
indicates that the corresponding eigenvector is a repeated eigendirection, while a large value of r*, indicates 
that the corresponding eigenvector describes a new direction in the data. 

The eigenvalues are consistent with the known analytic eigenvalues of the Laplacian (see ([T} and (I?])), 
shown in red. Furthermore, the two unique eigendirections can easily be identified, since their corresponding 
regression error is large. As expected, the gap between the two meaningful eigenvalues increases as 
the strip becomes longer. Using m, we can accurately estimate the relative lengths of the two unique 
eigendirections in each data set. 

3.3.2. Swiss roll 

Our second illustrative example consists of two different Swiss roll data sets. The data are sampled 
according to 

{zi,Z 2 ,Z 3 ) = {OcosO,9sin0,ht) (16) 

where 9 is sampled such that zi , Z 2 are uniformly sampled along the arclength of the spiral, and t is uniformly 
sampled from [0,1]. Note that, in this example, zi,Z 2 ,Z 3 are the original data coordinates; however, the 
data lie on a two-dimensional manifold parametrized by the 9 and t. The height of the first Swiss roll is 
h = 40, while the height of the second is h = 20. Each data set consists of m = 1500 points, shown in 
Figures |43 and l4el 

Figures l4bl and l4fl show the eigenvalue spectra from the analysis of the two data sets. Similar to Figure[31 
the bars are colored by the leave-one-out cross-validation error, where a small value of rj, indicates that 
the corresponding eigenvector is a repeated eigendirection. From these plots, one can conclude that the 
first two eigenvectors (/>i and (j )2 parametrize the first data set, while (j)i and ^5 parametrize the second. 
Figures l4c iMllgl and liiil show the two data sets, colored by the two eigenvectors identified as parametrizing 
the unique eigendirections. As expected, these eigenvectors are one-to-one with the arclength along the 
spiral, and the height of the Swiss roll. 
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Figure 4: Swiss roll example, (a) Data set 1; h = 40. (b) Eigenvalue spectrum from the diffusion maps 
analysis of data set 1. (c) Data set 1, colored by the first diffusion maps eigenvector, (d) Data set 1, colored 
by the second diffusion maps eigenvector, (e) Data set 2\ h = 20. (f) Eigenvalue spectrum from the diffusion 
maps analysis of data set 2. (g) Data set 2, colored by the first diffusion maps eigenvector, (h) Data set 2, 
colored by the fifth diffusion maps eigenvector. 


3.3.3. Torus 

For the third example, we consider a torus defined by 

{zi,Z 2 , Z 3 ) = ((ri + r 2 cos 02) cos0i, (ri + r 2 cos02) sin0i,r2 sin02) (17) 

where ri > r 2 are the outer and inner radii, respectively, and 61 and 02 are uniformly and independently 
sampled from [0,27r). In this example, zi,Z 2 ,Z 3 are the coordinates of the data; however, we expect to 
obtain from our analysis two eigenfunctions (sin 0 i and cos 0 i) which parametrize the outer circle, and 
two eigenfunctions (sin 02 and cos 02 ) which parametrize the inner circle. Intuitively, increasing ri can be 
viewed as analogous to increasing the ratio of Li to L 2 in the strip example and makes the eigendirections 
which parametrize the outer circle (cos 0 i and sin 0 i) more dominant compared to the eigendirections with 
parametrize the inner circle. Figure [5] shows the eigenspectra from the analysis of three different tori, 
colored again by the leave-one-out cross-validation error r^. We observe that as the torus becomes thinner 
(ri increases), the second pair of eigenvalues corresponding to unique eigendirections (corresponding to sin 02 
and cos 02 ) moves farther down in the spectrum. 

4. Chemotaxis: A Case Study 

Our main motivation for identifying the unique eigendirections is to facilitate the analysis of complex 
data sets where the underlying dimensionality and structure is not readily apparent. In data from complex 
dynamical systems, noisy microscopic behavior often gives rise to coherent macroscopic dynamics. In general, 
this mapping from microscale to macroscale is not always obvious. We will show how we can use our proposed 
methodology to extract a parametrization of data collected at the microscale which is consistent with a 
certain macroscopic behavior, without any a priori knowledge of the appropriate microscopic or macroscopic 
model. Furthermore, determining the true dimensionality of such a parametrization by identifying the 
unique eigendirections reveals the requisite dimensionality of a reduced model which captures the relevant 

8 


















































Figure 5: Tori data sets (top, see (fTTl) ') and corresponding diffusion maps eigenvalues (bottom) for (a) 
ri = 3, r 2 = 1. (b) ri = 5, r 2 = 1. (c) ri = 10, r 2 = 1. The tori are parametrized by two angles: 9i, 
which parametrizes the large outer circle, and 02, which parametrizes the smaller inner circle. In all three 
data sets, the first two eigenvalues/eigenvectors correspond to sin0i and cosdi. The second pair of unique 
eigendirections (corresponding to sin 02 and cos 02) are captured by components 7 and 8, 11 and 12, and 15 
and 16, respectively. 


macroscopic behavior. Knowing appropriate macroscopic variables and the true dimensionality can help 
inform modeling efforts and aid in positing useful macroscale models. 

Our model problem describes the process of cellular chemotaxis [2^, where biological cells exhibit co¬ 
herent macroscopic dynamics regulated by extracellular sensed signals in order to accomplish tasks such as 
finding food or navigatiM away from toxins. Several microscopic models have been proposed to describe 
chemotactic dynamics |26l [SOj. We will analyze one such model described by a one-dimensional velocity 
jump process [29|. This specific example has an analytic macroscopic description in which the dynamics of 
a large ensemble of cells depend on the value of a single system parameter. This description serves as a 
“ground truth” and will allow us to validate our results which are obtained in an unsupervised manner. 

Thus far, we only considered synthetic data sets for which the Euclidean distance between data points 
served as an informative metric. Here, we will show that our approach, when utilizing the appropriate 
statistical observers and ajfinity metric between pairs of observations, uncovers a parametrization of the 
microscopic data which is consistent with the macroscopic model. Furthermore, we will show that changes 
in the “mode” of dynamical behavior resulting from changes in the system parameters can be automatically 
detected. 

4.I. Problem description 

The microscopic model consists of a collection of N cells (for our simulations, we take N = 1000) whose 
states are defined by their positions and velocities on a line, and the dynamics of each cell are governed by 
a stochastic process. Let Xi{t) and Vi{t) denote the position and velocity, respectively, of cell i at time t. 
The velocity of each cell is ±s, where s is a (fixed) speed. We initialize the cells such that 


a:i(0) = 0 

P{ui(0) = -l-s} = p 


(18) 


where 0 < p < 1 is the probability of a cell initially moving to the right. The velocity of each cell randomly 
switches between ±s following an (independent) Poisson process with rate A (this switching is physically 
controlled by extracellular signals). For our specific simulations, we set s^/A = 1, which is consistent with 
the analysis presented in (^ . We note that we have chosen a very specific one-parameter family of initial 
conditions which lead to simple dynamics and allow us to illustrate our main points; however, our analysis is 
not restricted to such specific cases and more complex sets of initial conditions could be used. Each data set 
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Figure 6: Embeddings using the first two diffusion maps eigenvectors computed from simulation data of the 
velocity jump process with t^ax = 10, At = 1, and (a-c) A = 1, s = 1, (d-f) A = 400, s = 20. The distances 
used in the diffusion maps kernel are the earth mover’s distances between the histograms of cell positions. 
The data are colored by (a, d) p, the initial probability of a cell moving to the right, (b, e) t, time, and (c, f) 
Ep'^ — Ep~, the difference between the average position of the left- and right-moving cells. Representative 
histograms of the cell positions are shown for selected data points. 


consists of 10 stochastic simulations, with initial conditions uniformly chosen such that 0.1 < p < 0.9. We 
allow each simulation to evolve for tmax time units, and record the states of the N cells every At time units; 
we use data with t > 0 for analysis. We will show that the parameter tots = tmax/N, which determines the 
time scale of observation, is of critical importance when we address the limiting asymptotic behavior. 


4-2. Observers and metrics 

In general, manifold learning techniques have two essential components. One is the appropriate observers 
of the system state. These observers should be informative as to the state of the system, as well as insensitive 
to noise in the system. The second is a distance metrie between the observations that captures a notion of 
locality: observations which we perceive to be similar should yield a small value for this distance. 

For the chemotaxis example, we use histograms of the cell positions as observers. Histograms are invariant 
to the indexing of the cells, while retaining information about the relative spatial locations of the cells. 
Histograms are also robust to noise [2^. Instead of the standard Euclidean distance, we use the earth 
mover’s distance (EMD) [l^ as the metric between pairs of histograms. Conceptually, the EMD measures 
how much “work” it takes to transform one probability density into another. It therefore not only considers 
where the densities are inconsistent, but also how far apart the inconsistencies are from each other. Although 
the brute-force computation of the EMD is computationally expensive, there has been a plethora of work 
in developing efficient algorithms for its approximation 331. Eor the specific case of spatially one- 


dimensional data, the EMD is equivalent to the Li-norm between the cumulative distribution functions of 
the data 


34j . which can be approximated from histograms as 


\z{i) - z{j)\\EMD 


E 








(19) 


the histograms z{i), z{j) are defined on n equally-spaced bins in K. (for our simulations, we will take n = 32), 
and Zk denotes the bin. Here, we choose to only use the positions of the cells in the distance computation; 
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Figure 7: Analysis of three different chemotaxis simulation data sets, (a) A = 100, s = 10. (b) A = 1600, 
s = 40. (c) A = 6400, s = 80. We set t^ax = 10 and At = 1, and the distances used in the diffusion 
maps kernel are the earth mover’s distances between the histograms of cell positions. For each data set, the 
eigenvalue spectrum, colored by r^, is shown in the top row. From the spectra, we can see that the first two 
components are informative for (a), that components 1 and 3 are informative for (b), and that components 
1 and 4 are informative for (c). The corresponding reduced diffusion maps embeddings are shown in the 
middle row. For comparison, the standard diffusion maps embeddings using the first two components are 
also shown in the bottom row in (b) and (c). 


utilizing both the positions and velocities produces qualitatively similar results (not shown). We note that 
although histograms are not essential to the theory of EMD, they make the computation practical. 

4-3. Results 

4-3.1. Identifying the unique eigendirections 

Figure [5] shows the results of analyzing two sets of chemotaxis simulations. One set of simulations 
(Figures IHaUHB is for a relatively small value of A, while the other set (Figures I6dlf6f1l is for a larger value 
of A. For both sets of simulations, the macroscopic variables p, which controls the initial distribution of cell 
velocities, and t, the time, are well-correlated with the eigenvectors cfi and 02 - In particular, p characterizes 
the fraction of initially right-moving cells (alternatively, the initial flux of the right-moving cells). The 
dominant coordinate (fi is correlated with p for the small A case (Figure [Sal), and correlated with t for the 
large A case (Figure jSe]), indicating that the relative importance of p and t changes in the two simulations. 

As illustrated in the synthetic data sets we discussed previously, the two leading eigenvectors are not 
guaranteed to correspond to unique eigendirections. Figure [7] shows the results of analyzing three different 
simulations which span a larger range of A values compared to FigurejSl We see from the eigenspectra and the 
cross-validation errors rk that the second unique eigendirection moves down in the spectrum as A increases. 
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The two-dimensional reduced diffusion maps embeddings obtained from eigenvectors which parametrize 
unique eigendirections capture the macroscopic variables p and t (we only show the correspondence of the 
two identified unique eigendirections with p in the middle row of Figure [T] the correspondence between 
these eigendirections and t is similarly consistent). In contrast, using the two leading eigenvectors (j)i and (j )2 
produces uninformative embeddings for large values of A, as (f )2 is a repeated eigendirection which does not 
parametrize any new directions in the data. Furthermore, although the eigenvalue spectra do not exhibit 
any large spectral gaps, the gap between the eigenvalues which correspond to new directions in the data 
(indicated by the dark blue) increases with A, indicating that the underlying manifold becomes narrower, 
and the corresponding eigenvectors provide us with an informative two-dimensional parametrization of the 
data. It now becomes clear that looking at the eigenvectors modulo repeated eigendirections is essential for 
extracting an informative parametrization of the data, as well as characterizing the effective dimensionality 
of the data. 


4-3.2. Analytic macroscopic description 

This particular model system can be usefully described through a known analytic macroscopic equation 
that governs the overall system behavior. For a large collection of cells {N oo), the system can be 
described using the evolution of the cell density. Let p{x, t) denote this evolving cell density field, and let 
p~{x,t) and p'^{x,t) denote the densities of the left-moving and of the right-moving cells, respectively. It 
can be shown that, as N ^ oo, the densities satisfy the following set of coupled partial differential equations 
(PDFs) 0: 


dp+ 

dt 


+ = -Ap+ -h \p 

OX 

dp~ . + . - 


( 20 ) 


Alternatively, (1^ can be rewritten as a single, second-order PDF (the telegrapher equation): 


dt'^ dt dx^ 


( 21 ) 


From (EU, we see that, for fixed values of A and s, the macroscopic state of the system (the probability 
density of the cells) is indeed a function of the initial condition (parametrized by p) and the time t. This 
implies that for fixed A and s, the microscopic data in a high-dimensional ambient space (e.g., the positions of 
all N cells) should lie on a two-dimensional manifold parametrized by p (which controls the initial ratio of p^ 
and p~) and t (which describes the evolution); uncovering the low-dimensional structure of the microscopic 
data helps reveal this manifold, leading to a subsequent physically meaningful parametrization of it. This 
is consistent with the results presented in Figures [S] and 0 where the two-dimensional embeddings obtained 
from microscopic simulation data are one-to-one with p and t. 

Guided by the known theory of the model macroscopic PDF system, we consider two asymptotic regimes 
for the simulation. When A —)• 0, the right-hand side of (Eni tends to 0, and (pni) becomes two uncoupled 
wave equations. 


5p+ 

dt 

dt 


, 9p+ 

+ 

OX 


— s- 


d_r 

dx 


= 0 . 


( 22 ) 


When A —>■ oo with s^/A fixed, (1^ approaches the diffusion equation, 


2 


dt 


= D 


d'^p 
dx'^ ’ 


(23) 


where D = s^/A. The above analysis shows that the initial distribution of velocities of the cells (determined 
by p in the microscopic simulations) plays a very different role depending on the value of A. When A —x 0, 
the dynamics are described by the telegrapher equation, and the effect of the initial velocity distribution 
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persists throughout the trajectory. When A oo, the dynamics are described by a single diffusion equation, 
and the initial conditions for the velocity are not especially important, since the velocity distribution quickly 
equilibrates and we see purely diffusive behavior. 

The relative importance oip and t from the analytic description are consistent with the results in Figure[6l 
where p is correlated with the dominant diffusion map coordinate when A is small, and becomes correlated 
with the subdominant coordinate when A is large. Furthermore, in the small A regime (wave equation), 
shown in Figures ISaHSH the points corresponding to small times are more tightly clustered than the points 
corresponding to large times. This is in agreement with the macroscopic model: at small times, the cells are 
more condensed around x = 0, and it is more difficult to distinguish the cells moving to the left from the 
cells moving to the right. On the other hand, at large times, once the cells move away from the origin, this 
separation between left-moving and right-moving cells is clear. For the large A case (diffusion equation), 
shown in Figures I6dl - l6fl we observe that at small times (before the cell velocities have sufficient opportunity 
to switch and equilibrate), the initial velocity distribution p can be well perceived in the embedding in Figure 
I6dl as the initial imbalance in left-right velocities affects the initial displacements. On the other hand, for 
longer times, we observe that the initial distribution p cannot be well perceived in the embedding in Figure 
I6d[ as the velocities have equilibrated and their initial distribution is practically forgotten. Overall, in both 
cases, we obtain, in an unsupervised data-driven manner, a useful low-dimensional embedding of the data. 
Correlating this embedding with potential candidate macroscopic observables (such as time, densities, fluxes, 
and initial condition statistics) helps elucidate an accurate picture of the macroscopic variables that govern 
the system dynamics. 

The analytic macroscopic model suggests that when A is small, the system dynamics can be described 
by the two densities and and when A is large, the dynamics can be described by a single density 
p = p'^ + p~. Figures 15^ and l6fl show the data, colored by the difference in the average position of the 
left- and right-moving cells. Clearly, for small A, both and p~ are required to describe the dynamics, 
and the difference between the two densities is a useful observable in parametrizing the data manifold (and 
thus a good macroscopic descriptor, a good “candidate macroscopic variable”). However, for large A, the 
two densities rapidly equilibrate, and the difference between the two densities is not a useful observable in 
parametrizing the data manifold. 

This analytic macroscopic description allows us to emphasize the importance of using an appropriate 
distance metric. We also analyzed the two sets of simulation data from Figure [5] using the standard Euclidean 
distance between the histograms to compute the distances in We empirically found that there is no 
appreciable correlation between the embedding coordinates and what we know to be meaningful macroscopic 
variables, p and t (the correlations between the embedding coordinates and the governing macroscopic 
variables are all less than 60%, with the correlations for the small A case being less than 20%) with this 
similarity measure. In contrast, the correlations between the embedding coordinates and the macroscopic 
variables when using EMD in the diffusion maps calculation are all greater than 80%. Clearly, using a metric 
which meaningfully describes the distances between observations is essential for obtaining an informative 
parametrization of data. We note again that the metrics we used in this work only employed observations of 
cell positions (and not of cell instantaneous velocities); if the instantaneous velocities are readily measurable, 
they could be incorporated in a useful and possibly even more informative metric for data analysis. 


4 . 3 . 3 . Detecting changes in dimensionality 

Detecting changes in effective dimensionality is essential for the analysis and modeling of (simulated or 
experimentally observed) dynamical systems. The approximate dimensionally of a data set can be estimated 
by examining the two eigenvalues corresponding to unique eigendirections. According to (fT51) . we 

plot 


'log^ 
logAi*2' 


(24) 


when this ratio becomes small, the data are effectively one-dimensional, as the extent of data along the 
second direction is very small compared with the first. 
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Figure 8: Detecting the change in dimensionality. The appropriate ratio the two eigenvalues (see (1241) 1 which 
correspond to unique eigendirections is plotted as a function of tabs and A. Each data point is the average 
eigenvalue ratio over three replicate data sets, and in each simulation, At = tma^/lO. The distances used 
in the diffusion maps kernel are the earth mover’s distances between the histograms of cell positions. The 
curve tabs = 1/A is shown in white. 


Figure [5] shows the eigenvalue ratio in (El as a function of A and the time scale of observation tabs = 
tmax/N. From this eigenvalue ratio, we can easily detect changes in the estimated dimensionality as we 
vary the relevant parameters. For small A and/or short tabs, the data are effectively two-dimensional, as 
both p and t are important to the observed dynamics. However, when A and/or tabs is large, the system 
rapidly equilibrates and the data are effectively one-dimensional, parametrized by time. For this specific 
example, these changes are consistent with the analytically-predicted shift in the dynamical behavior from 
the telegrapher equation to the diffusion equation. From (1^ . we can see that tabs, the time scale of 
observation, should be larger than the time scale 1/A for the diffusion equation limit in (E^ to hold; we 
therefore expect a transition in the dynamical behavior at 

tabs ~ (25) 

This curve is also plotted in Figure |5] and is consistent with the transition predicted by the diffusion maps 
eigenvalues. 

5. Conclusions 

This paper addresses repeated eigendirections, a critical issue in applying diffusion maps to the analysis 
of complex dynamic data. We show that, using local linear regression, we can automate the detection of 
eigenvectors corresponding to unique directions in the geometry of the data. From this detection, we can 
then obtain a reduced dijfusion maps embedding which is both parsimonious and induces a metric which is 
equivalent to the standard diffusion distance. We showed that this algorithm enables us to more fruitfully 
analyze data from a complex dynamical system, enabling the extraction of good reduced coordinates and the 
detection of changes in the dimensionality of the macroscopic dynamics. We are confident that the proposed 
methodology will be helpful in the analysis of high-dimensional data sets for which the dimensionality of 
the underlying manifold is unknown. 
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